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Abstract 

Many methods of gene set analysis developed in recent years have been compared empirically in a number of com- 
prehensive review articles. Although it is recognized that different methods tend to identify different gene sets as 
significant, no consensus has been worked out as to which method is preferable, as the recommendations are 
often contradictory. In this article, we want to group and compare different methods in terms of the methodological 
assumptions pertaining to definition of a sample and formulation of the actual null hypothesis. We discuss four 
models of statistical experiment explicitly or implicitly assumed by most if not all currently available methods of 
gene set analysis. We analyse validity of the models in the context of the actual biological experiment. Based on 
this, we recommend a group of methods that provide biologically interpretable results in statistically sound way. 
Finally, we demonstrate how correlated or low signal-to-noise data affects performance of different methods, 
observed in terms of the false-positive rate and power. 
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INTRODUCTION 

Massive throughput techniques, such as microarrays, 
allow to study genome-wide associations of gene 
expression with diseases or phenotypes. The focus 
in expression data analysis has shifted in recent 
years from single gene to the gene-set level. This 
important change has been motivated biologically, 
as many diseases are believed to be associated with 
modest regulation in a set of related genes rather than 
a strong increase in a single gene [1]. Gene-set ana- 
lysis is also expected to ease common limitations of 
standard single gene studies, such as the difficulty in 
interpretation of multiple testing corrected lists of 
differentially expressed genes, or poor reproducibility 
of important gene lists yielded by independent 
studies [2, 3]. 

A variety of methods for gene-set analysis have 
been developed. These methods incorporate previ- 
ous biological knowledge about the sets of presum- 
ably related genes in the analysis of data. The 
methods can be broadly categorized as 'self- 
contained' or 'competitive' [4]. The former analyses 



association between the phenotype and expression in 
the gene set of interest while ignoring genes not in 
the gene set. Examples include the Globaltest [5, 6], 
the ANCOVA [7] or a number of tests described by 
Fridley etal. [8]. Competitive methods compare the 
gene set with its complement in terms of association 
with the phenotype. Examples include the popular 
GSEA algorithm [1], GSA (gene-set analysis) [9], 
SAFE [10] and Random set methods [11]. 

Although the numerous gene-set analysis methods 
propose different measures of association (statistics), 
they all attempt to use the framework of statistical 
hypothesis testing to assess significance of association 
and, hence, assign a P-value to the gene set. The null 
hypotheses commonly assumed were classified by 
Goeman and Buehlmann [12] as 'self-contained' or 
'competitive'. The self-contained null hypothesis as- 
sumes that no genes in the gene set are associated 
with the phenotype. Obviously, self-contained 
methods are designed to test this hypothesis. The 
competitive null hypothesis assumes that genes in 
the gene set are not more associated with the 
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phenotype than genes outside the gene set. It is com- 
monly taken for granted that competitive methods 
test this hypothesis. We consider this further in this 
article. 

Different procedures are proposed to derive distri- 
bution of the test statistic under the null hypothesis: 
using randomization of samples (phenotype labels) 
[1, 5, 10]; using randomization of genes [13] or 
Tian's Q2 statistic [14]; or using different parametric 
assumptions PAGE [12, 15, 16]. 

We refer the reader to the study conducted by 
Nam and Kim [4] for a comprehensive list of algo- 
rithms and tools available for gene-set analysis. We 
observe that vast majority of the tools reported 
implement competitive algorithms, and roughly 
half of the methods estimate significance by sample 
randomization, with the remaining half based on 
gene randomization or on parametric models. 

A number of review articles have been published 
in recent years comparing performance of the meth- 
ods based on real or simulated data sets [8, 12, 
17—22]. However, they often provide contradictory 
usage guidelines and recommendations as to 
which methods should be preferred. In the import- 
ant methodological article [12], Goeman and 
Buehlmann strongly recommend to use self- 
contained methods, as they provide statistically inter- 
pretable results. In contrast, they argue that methods 
based on gene randomization 'lead to wildly mislead- 
ing interpretations and should be discouraged in the 
strongest terms'. Nam and Kim [4] recommend to 
use GSEA or similar methods, such as GSA or SAFE, 
all of which belong to competitive sample- 
randomization methods. However, for small sample 
sizes, they recommend gene-randomization instead. 
The authors also point that sample-randomization 
methods may be over-powerful, as they tend to 
declare gene sets as significant based on only a few 
differentially expressed genes. Liu etal. [17] empiric- 
ally compared three self-contained methods — 
Globaltest, ANCOVA and SAM-GS— and con- 
cluded that they show similar performance in terms 
of size and power, provided that data are properly 
standardized to stabilize per-gene variance. 

Fridley et al. [8] evaluated a number of self- 
contained methods using an extensive simulation 
study and demonstrated that the Globaltest and the 
Fisher's method aggregating P-values from individual 
genes were the most powerful. 

In their recent study [22], Hung et al. recom- 
mended using enrichment score statistic as defined 



in GSEA [1], or alternatively the Wilcoxon rank- 
sum statistic, with empirical rather than analytical 
null distribution. Their recommendation was based 
on the Mutual Coverage criterion they defined to 
measure the extent to which gene sets deemed sig- 
nificant by one method are reproduced by other 
methods analysed by the authors, based on a collec- 
tion of >100 experimental data sets. 

Ackermann and Strimmer [21] recommend to use 
simple univariate procedures, such as the mean of the 
squared f-statistics of genes in the gene set, with 
either sample or gene randomization. They argue 
against using the popular GSEA method, strongly 
rejecting the recommendation put forward by Nam 
and Kim [4]. Their guidelines are based on a com- 
prehensive simulation study in which they exten- 
sively used gene randomization as the means to 
estimate significance of their findings. 

Irizarry et al. [16] also advise against using the 
GSEA algorithm and propose to use simple paramet- 
ric tests instead: the 2-score and the % 2 test to detect 
changes in location and scale. They argue that their 
approach leads to more powerful and much simpler 
procedure than the popular GSEA. 

Wu and Lin [20] showed that selection of the 
gene-set analysis method matters, as different meth- 
ods are likely to produce different, hardly overlap- 
ping results. For instance, based on the diabetes data 
set [23] , GSEA and Globaltest methods report 1 8 and 
4 significant data sets, respectively, with only 1 gene 
set reported by both methods. 

In this work, we want to analyse methodological 
differences between gene-set analysis methods. We 
divide the methods into four groups: self-contained, 
competitive with sample randomization, competitive 
with gene randomization and parametric. We analyse 
the statistical models underlying these different meth- 
ods and discus whether the models agree with the 
actual experiment that produced the data. We show 
that gene randomization and some parametric meth- 
ods are based on a statistical model that does not 
follow the underlying biological experiment; hence, 
what the methods name as 'significance' or 'P-value' 
should not be interpreted as such. We then propose a 
different interpretation of these procedures, no longer 
based on the statistical hypothesis testing, which leads 
to biologically relevant interpretation of results. 

In this article, we extend the methodological ana- 
lysis of gene-set methods presented by Goeman 
and Buehlmann [12] who clarified fundamental dif- 
ferences between self-contained and competitive 
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hypotheses, and between subject-sampling and gene- 
sampling approaches. They elaborated the um 
model, which underlies the earliest, overrepresenta- 
tion methods of gene-set enrichment. Based on this 
model, they formulated fundamental criticism of 
P-values based on gene sampling. 

The extensions provided in this article are 3-fold. 
First, we clarify the statistical models and null 
hypotheses of most if not all currently available 
methods of gene-set enrichment, and we discuss 
which of the models disagree with the biological 
experiment. We show that popular competitive 
methods do not actually test the competitive null 
hypothesis formulated by Goeman and Buehlmann 
[12]. Second, we propose a different interpretation of 
results of some popular methods based on gene ran- 
domization, which does not violate the organization 
of the biological experiment, but gives up the con- 
cepts of significance or P- values. Third, we present a 
simple simulation study to illustrate performance of 
selected self-contained, comparative and parametric 
methods in terms of size and power, focusing on 
correlated and low signal-to-noise data. 



METHODS 

We introduce the following notation. Let 
W = = 1, . . . ,dj = 1, . . . ,n denote the 

matrix with results of a massive throughput study 
(e.g. gene expression data), with d dimensional vec- 
tors of gene expression measured for n samples. We 
also define Y = (Yj)j = 1, . . . ,n as the (1 x n) target 
vector of class labels for samples. Y often represents 
tumour versus control samples; however, it can also 
contain continuous measurements. We use here the 
convention common in bioinformatics literature 
with samples represented by columns and genes (fea- 
tures) by rows. Let W\. and W.t represent the t'th row 
and the jth column of W. 

We want to analyse the association of a given gene 
set G with the target. Let G represent the set of 
indices of rows of W that correspond to genes in 
the gene set G, and G C — indices of the remaining 
genes (i.e. complement of G). Let m = |G| be the 
number of elements in G. 

It is convenient to represent the matrix of gene 
expressions for genes in G as X = (Wj.),i G G, so 
that X — (xij),i = 1, . . . ,mj = 1, . . . ,n. Similarly, 
we define X c: = (Wi),v£G. 

We also define as t = (f,),t = 1, . . . ,m, the meas- 
ure of association between genes in the gene set and 



the target (e.g. the t-statistic for binary Y), and as 
P = (Pi),i = 1, . . . ,m, the corresponding P- values. 
Similarly, association of genes in G " with the 
target will be represented by the vectors t and P c . 

In Table 1, we present competitive, self-contained 
and parametric methods of gene-set analysis. As we 
do not aim to evaluate all published methods, we 
only illustrate the groups by some popular or repre- 
sentative statistics. 

COMPETITIVE METHODS WITH 
RANDOMIZATION OF GENES OR 
SAMPLES 

Competitive methods aim to test the null hypothesis 
that genes in the gene set G are at most as often dif- 
ferentially expressed as the genes in G ' [12]. Rejection 
of the hypothesis (i.e. P-value < 0.05) indicates that the 
gene set includes significantly more differentially ex- 
pressed genes than the remaining collection of genes in 
the experiment, and therefore can be declared as 
activated. 

The basic idea of these methods is well illustrated 
by the Ql statistic proposed by Tian etal. [14]. The 
statistic aggregates the measures of association of 
individual genes in G with the target. Significance of 
the Ql statistic is calculated by permutation of genes 
(rows of the matrix W), with the P- value calculated as 

1 B 

P = -J2 I (QU> Ql) (i) 

where Ql, is the value of the Ql statistic recomputed 
for the gene set after the t'th permutation of genes, 
and B is the total number of permutations. 

Another group of competitive methods calculate 
the test statistic based on X and X c , and estimate its 
significance by permuting subjects (elements of Y). 
The most popular example is the GSEA method [1]. 
The statistic (enrichment score, ES) is the weighted 
Kolmogorov— Smirnov statistic comparing the ranks 
of genes in G with the uniform distribution. This is 
motivated by the observation that for the null hy- 
pothesis, the members of the gene set should be uni- 
formly distributed among all the genes (all the rows 
of W). Significance is estimated as in Equation (1), 
with permutation of samples instead of genes. 

Another interesting method is the GSA proposed 
by Efron and Tibshirani [9] . The method attempts to 
combine gene and sample randomization in one pro- 
cedure called 're-standardization'. For a gene set G, 
the method calculates the test statistic (gene set score) 
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Table I: Examples of competitive, self-contained and parametric methods 



Method 



Statistic 



Significance assessment 



Competitive with gene 
randomization 

Ql (Tian et al. [14]) 
FCS — Functional Class Score 
(Pavlidis et al. [24]) 
Competitive with sample 
randomization 
GSEA (Subramanian et al. [I]) 



GSA (Efron and Tibshirani [9]) 



SAFE (Barry et al. [10]) 

Self-contained 

Globaltest (Goeman et al. [5]) 



Q2 (Tian et al. [14]) 
FCS.SC — self-contained version 
of FCS (Pavlidis et al. [24]) 
ES.SC — self-contained version 
of the enrichment 
score (Subramanian et al. [I]) 
Parametric 

PAGE (Kim and Volksy [15]) 



CATEGORY (Jiang eto/. [25], 
Irizarry et al. [16]) 
Other parametric tests 
proposed by Irizarry et al. [16] 



Ql = 

FCS : 



±E£it. 
^Er=,-i°g(P<) 



Kolmogorov-Smirnov statistic comparing ranks 
of P-values of genes in gene set versus uniform 
distribution 

The maxmean statistic S max equal: 



'<'' <0) '' 



Kolmogorov-Smirnov or Wilcoxon rank-sum 
statistic comparing t versus t c 



Q = ^Eii ^[ X -( Y " H)Twhere u and u 2 are 
the are the first and second central moment of Y 



Same as Ql 
Same as FCS 

Kolmogorov-Smirnov statistic comparing 
(Pj)ji = versus the uniform distribution 



z = 5(l-L- u c)Vrn 

where u, S are the mean and standard deviation 
of fold changes calculated for all genes, and u G is 
the mean of fold changes for genes in G 



Wilcoxon rank-sum statistic or % statistic 
comparing t versus t 



Gene randomization 
Gene randomization 



Sample randomization 



Sample randomization for standardized 
test statistics 

Sample randomization 



Asymptotic normal distribution, or for small 
sample approximation by scaled x 2 
distribution, or randomization of samples 

Sample randomization 

Sample randomization 

Analytical Kolmogorov-Smirnov distribution 
or randomization of samples 



Null distribution ofz~N(0,l) 

Null distribution ofz~N(0,l) 
Corresponding analytical distributions 



S as, preferably, the maxmean statistic S max 
(Table 1); however, simpler choices are also con- 
sidered (such as the mean of individual gene scores 
in G, essentially as in the Ql statistic). Significance is 
estimated using permutation of samples; however, 
the test statistic is standardized using gene scores cal- 
culated for all the genes in the study. Specifically, the 
P-value is calculated as: 

1 x— v / 5, — mean* S — meatis\ , . 

P = -)I : > : -) (2) 

Bjr( \ stdev* stdevs J 

where S, is the gene set score for the i'th permutation 
of samples, means and stdev $ are the mean and stan- 
dard deviation of individual gene scores for all genes 
in the study, and mean* and stdev* are the mean and 
standard deviation of individual gene scores 
calculated over all genes in the study and a large 
number of permutations [9]. Standardization of 
the gene-set scores makes this method competitive, 



i.e. allows us to compare scores of genes in G with 
those outside G. Strictly, the Equation (2) is 
used if the test statistic S is defined as the mean 
gene score or mean absolute gene score. For 
the maxmean statistic S max = max\s^ + \s^~^, as 
defined in Table 1, the re-standardized statistic S** ax 
is calculated as S** ax = max^s^** , with 

^(+)** _ meatls _|_ s tdev — me an*) /stdev*, where 

is the element 5^ + ' of the maxmean statistic cal- 
culated for the (th permutation of samples (the same 
re-standardization applies for Then the 

P-value is calculated as the fraction of sample per- 
mutations in which S** > S max . 



SELF-CONTAINED METHODS 

Self-contained methods test the null hypothesis that 
no genes in the gene set are associated with the 
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target. The hypothesis is verified based on the data 
(X.i,Yi),i = 1, . . . ,n while ignoring X ( '. 

A prominent example is the Globaltest [5] based on 
generalized linear model of the relationship between 
X and Y, with the null hypothesis assuming that all 
the coefficients in the model are zero. The Globaltest 
is derived as the score test, which guarantees high 
power for gene sets with many genes moderately 
associated with the target. Similar multivariate pro- 
cedure was proposed by Mansmann and Meister [7]. 

Other self-contained approaches define the test 
statistic as the aggregate of individual gene scores 
(such as the Q2), with the null distribution derived 
by permutation of samples. Following this idea, a 
modification of popular competitive methods can 
be proposed that makes them self-contained. For 
instance, the FCS statistic [24] with sample-random- 
ization null distribution becomes self-contained, and 
the ES statistic based on raw P- values instead of ranks 
also becomes self-contained. 

Other self-contained approaches were proposed 
by Fridley etal, Dinu et oil. and Kong etal. [8, 26, 27]. 

PARAMETRIC METHODS 

Several authors criticized empirical null distribution 
of the GSEA or related methods and argued that 
simpler statistics can be used for which parametric 
null distributions are known. For instance, Irizarry 
etal. [16] proposed to directly compare the vectors 
of associations with the target t and t using more 
powerful tests for location and scale instead of the 
K— S test, with their analytical null distributions 
rather than empirical distributions. Similar idea was 
suggested by Jiang and Gentleman [25] . 

Another example of the parametric approach is 
the PAGE [15]. The method calculates the mean 
fold change for all genes in the experiment and com- 
pares this mean with the mean fold change for the 
genes in G. Under the null hypothesis, the 2-score 
should follow the normal distribution, which is used 
to estimate the P-value. 



MODELS OF STATISTICAL 
EXPERIMENT ASSUMED BY 
DIFFERENT METHODS 

Although different in nature, virtually all methods of 
gene-set analysis interpret their results in terms of 
significance or P-valnes — the concepts that are mean- 
ingful in the context of statistical hypothesis testing. 



To correctly interpret the results, it is essential that 
the underlying statistical assumptions are clearly 
stated regarding: the null and alternative hypotheses, 
the sample (size and independence) and derivation of 
the null distribution of the test statistic. In this sec- 
tion, we analyse the models of statistical experiment 
explicitly or implicitly assumed by different gene-set 
analysis methods. We use here the notation intro- 
duced in the 'Methods' section. 

Model 1 

In this model, we analyse association between two 
random variables: y, phenotype of the subject (pa- 
tient), and X e R m , expression of genes in set G. We 
consider the experiment data X, Y as the sample of n 
independent realizations of these variables: 

(X. u Y 1 ),...,(X. n ,Y n ) (3) 

The null hypothesis assumes that X and 3^ are 
independent. 

If T is the test statistic, then the null distribution of 
the statistic can be non-parametrically estimated by 
re-calculating the statistic based on (many) samples as 
in Equation (3), with the values of Y randomly 
permuted. 

It should be noted that the permutation null dis- 
tribution of T is available only when Y\ , . . . , Y„ can 
be considered independent samples from y, which is 
not true for time-series studies. 

This model of statistical experiment underlies the 
following gene-set analysis methods (Table 1): 

• Globaltest, 

• Q2, 

• FCS.SC, 

• ES.SC with the null distribution obtained by ran- 
domization of samples. 

Indeed, these methods calculate the test statistic 
based on expression of genes G, X,, which are sam- 
ples from X, with significance calculated using the 
null distribution obtained by calculating (many) 
values of the statistic under permutations of sample 
labels. Thus, by using the sample randomization, we 
de facto assume the null hypothesis that expression of 
genes for a subject X and the subject's phenotype 
y is independent, as stated in Model 1. 

Note, however, that this model does not apply to 
the ES.SC when the analytical null distribution is 
used. Model 2 is appropriate for this case. 
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Model 2 

In this model, we analyse association between the 
random variables: y, phenotype of the subject, 
and X e R, expression of a gene in the gene set G 
for the subject. We assume that expression of m 
genes in G measured for a subject i, denoted 
X.j = (X\j, . . . ,X m J) , consists of m independent 
samples from X, i.e. we assume that the genes in 
G are independent and governed by the same 
distribution. 

The null hypothesis assumes that X and y are 
independent. 

The experiment data X, Y represents results of 
m independent tests each with n samples from 
X and y. 

(x u ,Y x ),...,(x Un ,Y n ) 

(4) 

(X mt i,Y\), . . . ,(X m> „,Y„) 

These tests produce t = (t\, . . . ,t m ) — m independ- 
ent gene scores, i.e. measures of association between 
X and y, with their corresponding P-values 
P = (P u . . . ,P m ). 

Under the null hypothesis in this model, the 
P-values come from the Uniform(0,l) distribution. 
Hence, any gene-set analysis method that tests for 
uniformity of P ; , or for some other analytical distri- 
bution of the combination of independent /-statistics, 
or P-values calculated for genes in the gene set 
G actually realizes this model of statistical experi- 
ment. Examples of such methods include: 

• ES.SC with the analytical null distribution 
(Table 1), 

• The parametric method proposed by Irizarry et al. 
and Jiang and Gentleman [16, 25] (CATEGORY 
in Table 1), 

• Fisher's method for combining independent 
P-values, using the X 2 nu h distribution [8], 

• Testing for normality of t { (as proposed by Irizarry 
etal. [16]). 

The last two methods, suggested in literature, are 
not reported in Table 1. 

All these methods realize Model 2, as they 
consider the values t\ , . . . ,t m (or Pi , . . . ,P,„) as inde- 
pendent samples from some distribution, and test 
whether the sample comes from the null distribution 
that is based on the assumption that the m tests which 
produced t\, . . . ,t m or P\, . . . ,P m tested independent 



variables: X (which represent expression of a 
given gene in G) and y (phenotype). Hence, by 
using this procedure to estimate significance, we de 
facto assume that genes in G are iid, as stated 
in Model 2. 

Model 3 

In this model, we consider the data 
t = (t{),i = l,...,m and t c =(t?),i = 1, . . . ,{d — m) 
as iid samples drawn from two unknown distributions 
denoted T and T . Although it is hard to interpret 
these random variables in the context of the actual 
biological study, we loosely describe T and T ' as 
association of genes in G and outside G, respectively, 
with the target. 

Different null hypotheses are defined in this 
model, e.g. that the variables T and T ' have the 
same distribution, or that they do not differ in some 
specific parameter (e.g. mean). 

The gene-set analysis methods that are based on 
this model use different tests to compare t and t c and 
different procedures to obtain the null distribution of 
the test statistic. Examples include (Table 1): 

• Statistical tests directly comparing t and t C (e.g. 
using the f-test, Wilcoxon rank-sum, % 2 tests), as 
proposed by Irizarry etal. [16], 

• PAGE [15], 

• Competitive methods with gene randomization, 
such as the Ql or FCS. 

It should be noted that although the Ql or FCS 
statistics are based only on t and do not take t into 
account, the methods are competitive because of the 
gene randomization procedure used to derive the 
null distribution. Indeed, if the gene permutation- 
based distribution is used as the null distribution to 
obtain the P-value as in Equation (1), then this is 
equivalent to the assumption that t and ( have 
been drawn from the same distribution. 
Additionally, by using elements of t and elements 
of t ' as samples from their underlying distributions, 
we defacto assume that elements of t are independent, 
and elements of t are independent, which further 
implies independence of genes. 

Model 4 

In this model, we consider the experiment data Y, X 
and X c as n independent realizations of the random 
variables y, X € R m and X c e R^-">\ which rep- 
resent phenotype of the subject, the subject's 
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expression of genes in G and expression of genes not 
in G: 

(X. u X c l ,Y 1 ),...,(X. n ,X c n ,Y n ) (5) 

The test statistic is calculated based on X and Xp, 
with its null distribution obtained by randomization 
of samples, as in Model 1. This procedure to obtain 
the null distribution implies that the actual null 
hypothesis assumes that X and 3^ are independent 
and X and y are independent. 

This model of statistical experiment is valid for 
competitive methods with randomization of samples, 
such as: 

• GSEA, 

• SAFE. 

This model differs from Model 1 in how the test 
statistic is constructed: in Model 1 , it is based only on 
X, whereas in Model 4, it includes X and Xr. 
Note that by permuting sample labels assigned to 
the data vectors X.i,X c f to calculate the null distri- 
bution of the test statistic, we de facto assume inde- 
pendence of the phenotype and genes in G and also 
independence of the phenotype and genes in G . 

Finally, we want to analyse the model of statistical 
experiment realized by the GSA method. The 
method uses sample permutation test [Equation (2)] 
to empirically obtain the null distribution of the test 
statistic. Referring to Equation (2), we consider the 
standardized score (S — means) /stdev s as the test stat- 
istic, whose significance we want to obtain non- 
parametrically through sample permutations. Note 
that this statistic compares differential expression of 
genes in G with the mean level of differential 
expression observed for all genes (hence, the statistic 
is competitive). Significance of this statistic could be 
obtained by calculating many possible values of this 
statistic using sample permutations: 

1 v^-* ( S; — means: S — mean A 
P2 = -)I[- — -> - (6) 

B \ staevsi stdevs ) 

where mean$i and stdevsi denotes the mean and 
standard deviation of the gene scores over all genes 
calculated for the ith permutation of samples. Small 
P2 leads to rejection of HO of no association of ex- 
pression in subjects and their class labels, which we 
interpret that G is enhanced (contains more differ- 
entially expressed genes than expected by looking at 
the proportion of DE genes in the list of all genes), as 



this is what (S — means) /stdevs actually measures. 
However, we note that, strictly, the original GSA 
method [Equation (2)] does not produce the permu- 
tation-based null distribution of the test statistic 
{S — means)/ stdevs needed to derive the P-value. 
Instead, the method produces the null distribution 
of (S — mean*) / stdev* , which complicates rigid inter- 
pretation of the P- value produced by Equation (2). 
We will return to this in 'Empirical Comparison' 
section. 



VALIDITY OF THE MODELS IN THE 
CONTEXT OF BIOLOGICAL STUDY 

Significant result (i.e. P-value P<0.05) obtained by 
any gene-set analysis method means that if the actual 
null hypothesis of the particular method was true, 
then only the fraction of P repetitions of the experi- 
ment would return the data (i.e. the test statistic) 
more extreme than actually observed. Obviously, 
repetition of the experiment as perceived by the 
biologist consists in taking a new sample of subjects 
(e.g. patients) and taking a new sample of measure- 
ments from these subjects, as described in Equations 
(3) or (5). 

The research question posed by the biologist is 
related to whether (i) expression of genes in G is 
associated with the phenotype, or (ii) frequency of 
differentially expressed genes in G is higher than in 
G . Question (i) is related to the self-contained null 
hypothesis, whereas question (ii) is related to the 
competitive null hypothesis, as formulated by 
Goeman and Buehlmann [12]. 

Analysing models of the statistical experiment de- 
veloped in the previous section in the context of the 
biological experiment, we make the following 
observations: 

(1) Models 1 and 4 directly correspond to the 
organization of the biological experiment, 
where the sample [considering Model 1, 
Equation (3)] represents expression of m genes 
in G measured for n independent subjects 
(X.;,i = 1, . . . ,n), together with the subjects 
phenotypes (Y;,t = 1, . . . ,n). P-value P pro- 
duced by tests under Model 1 have clear biolo- 
gical interpretation described earlier in the text, 
where repeating experiments can be realized by 
taking new samples from X and y, i.e. testing 
new patients. Significant result declared by gen- 
e-set analysis methods based on Model 1 
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indicates that the gene set concerned contains 
genes associated with the phenotype. Significant 
result according to Model 4 indicates that either 
the gene set or its complement contain genes 
associated with the phenotype. Hence, the meth- 
ods based on Model 1 directly answer the research 
question (i). If we make (often reasonable) as- 
sumption that (most of) genes in G c are not dif- 
ferentially expressed, then methods based on 
Model 4 also answer the research question (i). 

(2) Model 2 also closely mimics organization of the 
biological experiment in terms of how a sample 
is taken and how experiment is repeated (which 
is essentially done as in Model 1). However, 
gene-set analysis methods based on this model 
rely on the assumption that genes in G are inde- 
pendent and identically distributed — both of 
which seems highly unrealistic in expression stu- 
dies. However, if the assumptions were true, 
then methods based on Model 2 would answer 
the research question (i). 

(3) Model 3 compares two samples of size m and 
d—m, where genes are the sampling units. In 
this model, we compare distributions of two un- 
known random variables (denoted previously T 
and T ), which produced the samples t and f . 
As we use statistical tests to compare T and T , 
we implicitly assume that elements in each of the 
samples t and t ' are iid, which implies that 
expression of genes in G are independent and 
identically distributed (which can be weakened 
for gene permutation procedures, as we only re- 
quire that genes are exchangeable). The most 
fundamental problem with interpretation of this 
model is the biological meaning of the variables 
T and T c . It is also not clear how repeating the 
biological experiment (i.e. testing new patients) 
could bring more samples from these variables. 

In summary, only methods based on Model 1 ad- 
dress the research question (i), which is related to the 
self-contained null hypothesis. Methods based on 
Model 4 (this group includes popular methods such 
as GSEA) also address the same question, although an 
additional assumption is made of no association of 
genes outside G with the phenotype. Methods based 
on Model 2 also test the self-contained null hypothesis, 
but they make the assumption that expression of genes 
in G is iid. Methods based on Model 3 compare two 
random variables with unclear biological meaning, 
and, additionally, they assume that genes in G are 



independent and identically distributed. This leads 
to the conclusion that none of the methods discussed in the 
previous section seems to address the research question ( ii), related 
to competitive null hypothesis, in statistically sound way. 

PROPOSED INTERPRETATION OF 
GENE SAMPLING PROCEDURES 

As the methods of gene-set analysis that use gene 
sampling to test the competitive null hypothesis are 
popular (see the review [4]), it seems compelling to 
find some interpretation of their results in line with 
the actual biological experiment. 

Here, we propose such interpretation. The sample 
in the model is a set of n subjects. For this sample, we 
measure association of expression of genes in the gene 
set G with the phenotype, denoted t. We consider t as 
one realization of a random variable U G R'" . In the 
same experiment, we measure association of genes 
outside G with the phenotype, denoted t . 

We want to assess whether G is enhanced, i.e. 
whether genes in G are more associated with the 
target than the remaining genes. To do this, we ran- 
domly draw many m-element subsets from t , 
denoted fm„ i=l,...,B. Now we can calculate a 
heuristic measure of enhancement of G defined as: 

s = ^2l(f(tmd>f(t)) (7) 

i=\ 

where / : R"' — > R, [e.g. fit) equals 7 or the mean 
absolute value of elements of t] . 

Note that s intuitively addresses the question for- 
mulated in the competitive null-hypothesis, where 
small values of 5 say that it is unlikely that a randomly 
selected gene set of size m composed of genes from 
G C realizes stronger association with the target 
than G, which suggests enhancement of the gene 
set G. 

However, we should be careful not to use the lan- 
guage of statistical hypothesis testing while interpret- 
ing 5, i.e. s is not a P-value, and s < 0.05 has nothing to 
do with 'statistical significance' of results. Result 
s < 0.05 means that only 5% of subsets (gene sets) of 
size m drawn from G c have as extreme t values (i.e. 
associations with the target) as observed in G. 

EMPIRICAL COMPARISON 

Genes-set analysis methods based on Model 2 and 3 
rely on the assumption that gene expressions are 
independent. In this section, we want to quantify 
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performance of different methods if data does not 
follow this assumption. In the first two experiments, 
we focus on the self-contained hypothesis, whereas 
in experiments 3 and 4, we focus on the competitive 
hypothesis. In experiment 1, we demonstrate the 
false-positive rate of different methods under varying 
correlation of genes in the gene set. In experiment 2, 
we compare the power of selected methods for the 
cases when only a small fraction of genes in the gene 
set are differentially expressed. We address the criti- 
cism formulated by some authors [4] that self- 
contained methods may demonstrate excessive sen- 
sitivity if only a few genes in the set are associated 
with the target. In experiment 3, we quantify the 
false-positive rate of different methods under 
the competitive hypothesis, where many gene sets 
are similarly differentially expressed, but one of the 
gene sets contains correlated genes. In experiment 4, 
we compare power of different competitive methods 
in the setting where many gene sets are differentially 
expressed, but one of the gene sets is significantly 
more differentially expressed. 

As our purpose is to compare performance of 
methods under known characteristics of data, we 
deliberately restrict this analysis to simulated data. 

In experiment 1, we quantify type I error of dif- 
ferent methods based on simple simulated data set 
with h=30 samples and d = 100 or d = 1000 genes, 
out of which m = 40 genes constitute a gene set G. 
In this study, no genes in G or in G C are differen- 
tially expressed, but genes in G are correlated. We 
generate expression for the genes in G from the 
multivariate normal distribution with the mean for 
each of the m genes equal 0 and the covariance 
matrix, which has diagonal elements equal 1 and 
non-diagonal elements equal r. Note that as variances 
of genes are 1, ris the correlation of genes in G. The 
remaining d — m genes come from N(0,1). 

In the experiment, we vary the parameter r. We 
repeat the experiment 500 times and observe the P- 
value produced by different methods. We record 
how many times (out of 500) each of the methods 
produces significant P-value (P< 0.05), which meas- 
ures the false-positive rate of each method. 

Results of this study for d= 1000 are summarized 
in Table 2. 

As the gene sets contained no differentially ex- 
pressed genes, the false-positive rate of 5% was 
expected. It can be clearly seen that the methods 
based on Models 2 and 3 generate excessive 
number of false-positive results, with the effect 



increasing with growing correlation between genes 
in G. Hence, gene sets declared as significant by 
these methods may only include correlated genes, 
with no genes associated with the target. Type I 
error of the methods based on Models 1 and 4 is 
not affected by the correlation in the gene set. We 
also observe that for correlated genes, the GSA 
method tends to produce slightly more false-positive 
results than expected. This may seem surprising 
considering the fact that the method uses sample 
randomization. However, the problem can be 
accounted for by observing that the GSA sample 
permutation procedure [Equation (2), valid for 
S being the mean gene score of the mean absolute 
gene score) does not actually produce the empir- 
ical null distribution of the test statistic 
(S — means)/ stdevs, which is required to estimate 
the P-value, but rather of the quantity 
(S — mean*) / stdev* . The same remark is true when 
the maxmean S max is used as the test statistic (as in 
our simulation studies). To confirm this, we repeated 
the simulation with the GSA P-value calculated ac- 
cording to Equation (6), with the test statistic taken 
as the mean gene score, and observed the expected 
Type I error of ~5%, whereas the original Type I 
error for this statistic also exceeded 5%. This mod- 
ified version of the GSA based on Equation (6) is 
included in our simulation studies as GSA2. 

We repeated the study for d=l00 and observed 
that Type I error decreased only for the PAGE and 
PAR Wilcoxon methods (e.g. the error for PAGE 
under correlation 0 to 0.8 decreased to 0.008, 0.052, 
0.154, 0.368, 0.568, respectively). For all the other 
methods, Type I error is not affected by the total 
number of genes d. 

In experiment 2, we assume that n.DE out of m 
genes in G is differentially expressed and correlated. 
We divide the samples into two groups of 15 samples 
and generate expression of the n. DE genes in the first 
group from the multivariate normal distribution with 
the mean 0, and in the second group with the mean 
equal A. The covariance matrix for both groups has 
diagonal elements equal 1 and non-diagonal elem- 
ents equal r. The remaining m — n. DE genes in G 
and the d—m genes in G c are not correlated and 
not differentially expressed and come from N(0,1). 

In this study, we compare the power of different 
methods as a function of the number of differentially 
expressed genes (n.DE=2, 5, 10, 40), the effect 
strength (A = 0.5, 1, 1.5) and correlation r. We 
measure power as the fraction of replications of the 
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Table 2: False-positive rates for the self-contained hypothesis for varying level of correlation in the gene set 
observed for different methods of gene set analysis 



Method 


Correlation r of genes 


in G 








0 


0.2 


0.4 


0.6 


0.8 


Ql 


0.05 


0.096 


0.178 


0.22 


0.282 


GSEA 


0.034 


0.07 


0.066 


0.062 


0.06 


GSA 


0.056 


0.072 


0.07 


0.086 


0.096 


GSA2 


0.05 


0.057 


0.047 


0.06 


0.05 


SAFE 


0.053 


0.043 


0.06 


0.04 


0.057 


Globaltest 


0.05 


0.036 


0.038 


0.062 


0.066 


Q2 


0.054 


0.034 


0.046 


0.044 


0.05 


ES.SC 


0.048 


0.052 


0.054 


0.052 


0.054 


ES.SC Analytic 


0.044 


0.13 


0.318 


0.556 


0.822 


PAGE 


0.03 


0.148 


0.32 


0.606 


0.702 


CATEGORY 


0.066 


0.482 


0.632 


0.682 


0.772 


PARWilcoxon 


0.056 


0.516 


0.644 


0.716 


0.802 



Methods are denoted as inTable I. The following specific settings of the methods were used: SAFE uses the Wilcoxon statistic; Globaltest uses the 
asymptotic null distribution; ES.SC uses null distribution based on permutation of samples; ES.SC Analytic uses the Kolmogorov-Smirnov null distri- 
bution; GSA uses all genes in the data set for re-standardization; GSA2 is the modified version of GSA with the P-value based on Equation (6). 
PARWilcoxon is the parametric method suggested by Irizarry eta/. [16], see last row inTable I. 



experiment, which produce P-value < 0.05, i.e. 
declare the gene set as significant. 

Results are summarized in Figures 1—3. 

We focus only on the methods that use sample 
randomization: Globaltest, FCS.SC, ES.SC (methods 
based on Model 1), GSEA, SAFE (methods based on 
Model 4) and GSA. We do not consider methods 
based on gene randomization or parametric models 
(i.e. methods based on Models 2 and 3), as they tend 
to produce excessive number of false-positive results 
from data with only correlated but no differentially 
expressed genes. 

It can be clearly seen that if only a small fraction of 
genes in the gene set is associated with the pheno- 
type, then the methods considerably differ in power, 
with the Globaltest, GSEA and GSA consistently 
showing higher power than the remaining methods 
(e.g. Figure 3, bottom right panel). We also observe 
that for small effect (Figure 1), GSA demonstrates 
slightly better power than other methods, whereas 
for medium (Figure 2) and strong effects (Figure 3), 
the Globaltest outperforms other methods. Our 
study also confirms that the Globaltest tends to de- 
clare as significant the gene sets with few differen- 
tially expressed genes with strong effect (Figure 3, 
bottom left panel). This high power is not necessarily 
a desirable feature, if the purpose of analysis is to find 
gene sets with moderate, but consistent, effect 
observed over many genes. Finally, we observed sen- 
sitivity of the results to the total number of genes d. 



In Figure 4, we compare power of competitive 
methods for d=1000 and d=100 (we omit the 
self-contained methods, as their power is obviously 
not affected by d). We observe that increasing d (with 
m fixed) leads to higher power of competitive 
methods. 

In experiment 3, we analyse the type I error under 
the competitive hypothesis. We focus here only on 
competitive or parametric methods that compare ex- 
pression in G and in G , such as Ql, GSA, GSA2, 
GSEA, SAFE, PAGE and PAR Wilcoxon (Tables 1 
and 2). We generated data with 30 samples divided 
into two groups of 15 samples, and 1000 genes 
divided into 50 gene sets each with 20 genes. We 
generated expression of all the genes from N(0,1), 
then in 30% randomly selected genes we added 
A = 1 in the second group of samples. Hence, all 
the gene sets contained on average 30% (i.e. 
n.DE=6) differentially expressed genes. In the first 
gene set (GS1), we allowed for correlation among 
the genes (i.e. expression in GS1 was generated 
from the multivariate normal distribution, as in the 
second experiment, with the effect size A = 1 and 
n.DE—6). In this study, we varied the correlation 
in GS1. We repeated the experiment 200 times and 
reported the fraction of replications of the experiment 
that produced P-value < 0.05 for GS1, i.e. declared 
the first gene set as significant in comparison with 
the remaining sets. This measures the type I error of 
the methods. Results for 30% (n.DE=6) and 60% 
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Figure I: Power of selected methods as a function of correlation and the number of differentially expressed genes 
(n.DE) in the gene set. Small effect, A = 0.5. Number of genes d = 1000. 



(n.DE — 12) differentially expressed genes in all gene 
sets are summarized in Table 3. We observe that for 
uncorrelated genes in GS1, all the methods (with the 
exception of GSEA) control the type I error. 
However, correlated genes lead to increased type I 
error for the gene randomization (Ql) and parametric 
methods (PAGE and PAR Wilcoxon). 

In experiment 4, we want to compare power of 
different methods when the data contains many dif- 
ferentially expressed gene sets, with some of the gene 
sets significantly more differentially expressed. We 
focus here only on the methods analysed in experi- 
ment 3. We used the same data as in experiment 3 
with 30% of differentially expressed genes in the 
gene sets 2 through 50; however, we varied the 
number of differentially expressed genes in GS1 
{n.DE— 6, 12, 18). We repeated the experiment 
200 times and reported the fraction of replications 
of the experiment, which produced P-value < 0.05 
for GS1, i.e. declared the first gene set as significant 
in comparison with the remaining sets (we refer to 
this as 'power' of the methods). Results are summar- 
ized in Figure 5. We observe that for n.DE= 6, the 



GS1 is not regarded as significantly enriched by most 
of the methods, even if the differentially expressed 
genes are correlated. Only the GSEA tends to declare 
GS1 as significant (in 10—20% replications of the ex- 
periment), as it probably measures both association 
with the target and enrichment. As expected, with 
increasing n.DE, the methods recognize GS1 as en- 
riched; however, in our study, GSEA seems to have 
highest power (Figure 5, bottom left panel). 
Interestingly, if the differentially expressed genes 
are correlated, GSA, GSEA and SAFE seem to lose 
power. We observe that GSA2 [the modified version 
of GSA based on Equation (6)] generally realized 
higher power than the original GSA. To illustrate 
peculiarities of some test statistics, we also demon- 
strated the case where the gene set of interest con- 
tains fewer genes associated with the target then the 
remaining gene sets (point n.DE=0 in Figure 5). 
We observe that PAGE, SAFE and PAR 
Wilcoxon (proposed by Irizarry et at. [16]) declare 
GS1 as significant, although it contains no differen- 
tially expressed genes. Clearly, these methods signal 
that the pattern of expression in the gene set is 
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Figure 2: Power of selected methods as a function of correlation and the number of differentially expressed genes 
(n.DE) in the gene set. Medium effect, A = I. Number of genes d — 1000. 
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Figure 3: Power of selected methods as a function of correlation and the number of differentially expressed genes 
(n.DE) in the gene set. Strong effect, A = 1.5. Number of genes d = 1000. 
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Figure 4: Power of sample randomization, competitive methods for different number of genes in the study d = 100 
or d = 1000, as a function of correlation and the number of differentially expressed genes (n.DE) in the gene set. 
Medium effect, A = I. 



Table 3: False-positive rates for the competitive hypothesis as a function of the average number of differentially 
expressed genes in all the gene sets (n.DE) and correlation in GSI 



Method 


n.DE 


Correlation r of 


genes in GSI 








0 


0.2 


0.4 


0.6 


0.8 


Ql 


6 


0.005 


0.01 


0.055 


0.035 


0.095 


GSEA 




0.15 


0.205 


0.155 


0.175 


0.16 


GSA 




0.01 


0.01 


0.005 


0 


0.005 


GSA2 




0.01 


0 


0 


0.005 


0 


SAFE 




0.01 


0.01 


0 


0 


0 


PAGE 




0 


0.005 


0.035 


0.045 


0.09 


PARWilcoxon 




0.005 


0 


0.02 


0.01 


0.01 


Ql 


12 


0.01 


0.065 


0.09 


0.115 


0.175 


GSEA 




0.04 


0.085 


0.075 


0.09 


0.075 


GSA 




0.02 


0.02 


0.005 


0.03 


0.005 


GSA2 




0 


0.02 


0.025 


0.02 


0.01 


SAFE 




0.005 


0.01 


0.01 


0.025 


0.035 


PAGE 




0.005 


0.07 


0.15 


0.26 


0.285 


PARWilcoxon 




0.005 


0.02 


0.115 


0.175 


0.245 
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different than in the remaining sets, and not 
necessarily that the gene set is enhanced. 

CONCLUSION 

In this article, we analysed the models of statistical 
experiment that underlie commonly used methods 
of gene-set analysis. We showed that only the 
self-contained methods and competitive methods 
that use sample randomization to generate the 
null distribution are based on the models of experi- 
ment that closely follow organization of the actual 
biological study. Thus, these methods produce stat- 
istically sound results, albeit under slightly different 
null hypotheses. Other gene-set analysis methods, 
those based on simplifying parametric assumptions 
or on gene randomization, either rely on unrealistic 
assumptions pertaining to distribution of genes 
or, additionally, compare random variables, whose 
biological interpretation is unclear. We must 
also note that popular competitive methods 
(such as GSEA or SAFE) do not, strictly, test the 
competitive null hypothesis, as formulated by 
Goeman etal. A significant result from these methods 



does not necessarily mean that the gene set of interest 
contains more genes associated with the phenotype 
than its complement, but it rather means that either 
the gene set or its complement are associated with 
the phenotype. 

Although we do not want to recommend one, best 
method of gene set analysis, based on this work we 
could suggest the Globaltest, GSA or GSEA as the 
preferable tools for gene-set analysis studies. These 
methods produce biologically interpretable P-values 
and realize higher power than other sample- 
randomization methods compared in this article. We 
observe, however, that significant results from the 
Globaltest may be because of only a few genes 
strongly associated with the target. GSA or GSEA 
methods seem to be less powerful for such data, 
which seems to be an advantage of these methods. 

Finally, we want to point that when comparing 
results produced by different methods or considering 
the application of some methods, it seems advisable 
to consider the actual statistical model implied by the 
method, as small P-values produced by gene- 
randomization or parametric methods do not neces- 
sarily imply that the gene set is enhanced. 
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Key Points 

• Out of the vast collection of gene-set analysis approaches, only 
the self-contained methods and competitive methods that use 
sample randomization to generate the null distribution are 
based on the models of experiment, which closely follow organ- 
ization of the actual biological study. 

• Popular competitive methods (such as GSEA or SAFE) do not 
test the competitive null hypothesis, as formulated by Goeman 
etal. A significant result from these methods does not necessar- 
ily mean that the gene set of interest contains more genes 
associated with the phenotype than its complement, but it 
rather means that either the gene set or its complement are 
associated with the phenotype. 

• Competitive methods that use gene randomization can produce 
biologically meaningful results, although the results cannot be 
formulated using the concept of statistical significance or a 
P-value. A heuristic measure of similarity of differential expres- 
sion patterns in the gene set and its complement can be used 
instead. 
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